In this project we will try to build a model that will predict the median housing price of in any district of California
We will the California Housing Prices dataset from the StatLib repository. This dataset was based on data from the 1990 California census. It is not exactly recent, but it has many qualities for learning, so we will pretend it is recent data.
Here we try to predict a value, so the problem to use is a Regression problem
A typical P.M for regression is RMSE. It measures the standard deviation of the errors that system makes in its predictions. For example, an RMSE equal to 50,000 means that about 68% of the system’s predictions fall within 50,000 of the actual value, and about 95% of the predictions fall within $100,000 of the actual value
with :
The MAE is based on calculating the Manhathan distance with the absolut value of each substraction instead of the calculating the root for the RMSE method.
Even though the RMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use another function. For example, suppose that there are many outlier districts. In that case, you may consider using the Mean Absolute Error (also called the Average Absolute Deviation).
RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.
The data are stored in a remote data store.
So to use it we can manualy download it and use it. Or we can build a function that downloads the tar, decompresses the file then extract the CSV.
The second technique is very usefull to create automated processes
import os
import tarfile
from six.moves import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH) :
if not(os.path.isdir(housing_path)):
os.path.mkdirs(housing_path)
tgz_path = os.path.join(housing_path,'housing.tgz')
urllib.request.urlretrieve(housing_url,tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
#fetch_housing_data()
housing = load_housing_data()
housing.head()
housing.info()
housing.ocean_proximity.value_counts()
display(housing.describe())
display(housing.describe(include='O'))
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(30,25))
plt.show()
From these graphs we can observe several things
1: there are some parameters which have strong spikes at the far right. After studying the description of the data we understand that these parameters have been caped. Its the case for :
2 : The attributes have very different scales
3 : We also see that many parameters extend to the right, this is called being tail heavy, This characteristic makes the prediction for some models difficult, it is therefore necessary to round the curve, to make it more Gaussian or normal
We gonna separate the dataset as 80% of training set and 20% of test set.
There's many methods to sample a dataset. We can sample the dataset randomly with this simple oneliner by sklearn :
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
It's considered as a sufficient method for sampling when we have large datasets. But if it's not, we run the risk to introduce a significant sampling bias. In fact, the samples need to be representative of the whole population. So if in the real world there's for exemple 51.3% of female and 49.7% of male, a sample of size 1000 needs to maintain that ratio : so we shoiuld have 513 female and 487 male. This is called stratified sampling
The experts told us that the median income is a very important attribute. So we want to ensure that the test set is reptresntative of the various categories of incomes in the whole dataset.
housing.median_income.hist()
train_set.median_income.hist()
test_set.median_income.hist()
The proportions are the same. But we can see that when the majority of the population is concentrated between 2 and 5, out of this interval there's not many elements. So we will create a stratified sampling function
import numpy as np
def processing_income_cat(df):
df["income_cat"] = np.ceil(df["median_income"]/1.5)
df["income_cat"].where(df["income_cat"] < 5, 5.0,inplace=True)
processing_income_cat(housing)
housing["income_cat"].hist()
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
Now the dataset has been splet into two subsets with a stratified function. Let's se if the proportions have been saved
housing["income_cat"].value_counts()/len(housing)
errors = pd.DataFrame(columns=['Overall','Random','Stratified','Rand % error','Strat % error'])
del errors
processing_income_cat(test_set)
errors = pd.DataFrame()
errors['Overall'] = (housing["income_cat"].value_counts()/len(housing)).sort_index()
errors['Random'] = (test_set["income_cat"].value_counts()/len(test_set)).sort_index()
errors['Stratified'] = (strat_test_set["income_cat"].value_counts()/len(strat_test_set)).sort_index()
errors
def error_counting(df):
data = {}
for col in df.columns[1:]:
df[col+' % error'] = ((df[col]-df.iloc[:,0]) / df.iloc[:,0]) * 100
error_counting(errors)
errors
As we can see, the test set generated using stratified sampling has income category proportions almost identical to those in the full dataset, whereas the test set generated using purely random sampling is quite skewed
Now we should remove the income_cat attribute so the data is back to its original state:
for set in (strat_test_set,strat_train_set) :
set.drop(columns='income_cat',inplace=True)
Since we've took a quick glance at the data. Now it's time to look more deeply in it.
To do so, we will explore the training set. Sometimes, when the training set is too large, we make sample it to make data exploration simpler. But here, as our training set is quite small, we d'ont have to take this aditionnal step. But to have keep our training set as it is. we'll copy it into a new set
housing = strat_train_set.copy()
housing.columns
housing.plot(kind='scatter',x='longitude',y='latitude')
Ok. So up to here, the plot looks like california, but it's hard to see anything else. To so discover the high density points, we'll specify the alpha to 0.1
housing.plot(kind='scatter',x='longitude',y='latitude',alpha=0.1)
We can see concentration on the coast area and and some areas in the inland. But as I don't know much California, Let's use another tool
import plotly.express as px
fig = px.scatter_mapbox(housing, lat="latitude", lon="longitude",
color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=10,
mapbox_style="carto-positron",opacity=0.1)
fig.show()
That's alot better. We can clearly see that there's high density areas at the big cities, the downtowns and famous areas, namely San Fransico, San Jose, Long Beach, San Diego, Fresno, Sacramento, etc.
Now let's see add the price of the houses to the visualisation and the population size to stand out more patterns
housing.plot(kind='scatter',x='longitude',y='latitude',alpha=0.4,
s=housing['population']/100, label='population',
c="median_house_value",cmap=plt.get_cmap("jet"),
colorbar=True,figsize=(15,10))
import plotly.express as px
fig = px.scatter_mapbox(housing, lat="latitude", lon="longitude",
color_continuous_scale=px.colors.cyclical.IceFire, zoom=10,
mapbox_style="carto-positron",opacity=0.4,
size='population',size_max=15, color='median_house_value')
fig.show()
This image tells that the housing prices are very much related to the location (e.g., close to the ocean) and to the population density, as we probably know already.
It will probably be useful to use a clustering algorithm to detect the main clusters, and add new features that measure the proximity to the cluster centers. The ocean proximity attribute may be useful as well, although in Northern California the housing prices in coastal districts are not too high, so it is not a simple rule.
Since the dataset is not too large we can compute the standard correlation coefficient (Pearson's r) between every pair of attributes
corr_matrix = housing.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)
The correlation coefficient only measures linear correlations (“if x goes up, then y generally goes up/down”). It may completely miss out on nonlinear relationships (e.g., “if x is close to zero then y generally goes up”). Note how all the plots of the bottom row have a correlation coefficient equal to zero despite the fact that their axes are clearly not independent: these are examples of nonlinear relationships.
Another way to check the correlation between attributes is to plot every numerical attribute against every other numerical attribute.
As there's 11 numerical attributes, plotting each relation reuslts in having 11² = 121 plots, which is too much. We will focus on the promising ones
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(15, 10))
The most promising attribute to predict the median house vale is the median income
housing.plot(kind="scatter", x="median_income", y="median_house_value",
alpha=0.1)
This plot reveals indeed the strong correlation between median income and median house value. But it shows few more things too. We can see the price cap at 500K$, but the plot reveals other straight lines :
We may want to try removing the corresponding districts to prevent the algorithms from learning to reproduce these data quirks.
housing.columns
There's some attributes here that, we've seen don't seem to have much correlation with the housing median value and particular meaning, these attributes are for exemple total_rooms,total_bedrooms,population,households.
With these attributes we can create new datas that can have strongest meanings
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
housing.corr()["median_house_value"].sort_values(ascending=False)
The new bedrooms_per_room attribute is much more correlated with the median house value than the total number of rooms or bedrooms. Apparently houses with a lower bedroom/room ratio tend to be more expensive. The number of rooms per household is also more informative than the total number of rooms in a district—obviously the larger the houses, the more expensive they are.
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()
Most of ML algorithmes can't work with missing values. So we have to create new functions to handle them. We have three options :
In our Dataset, this is only one attribute with missing values : total_bedrooms. we hve to fill this values with a value : We'll take the median.
To do so we'll use a handy class provided by sklearn which computes the medians of all the numerical attributes, it's usefull because we cannot be sure that there won’t be any missing values in new data after the system goes live, so it is safer to apply the imputer to all the numerical attributes
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median")
#The median can be computed only on numerical attributes, so we drop the only categorical attribute
housing_num = housing.drop("ocean_proximity", axis=1)
# fit the imputer instance on the training data
imputer.fit(housing_num)
print(imputer.statistics_)
print(housing_num.median().values)
# Now we can use the trained imputer to tranform the training set
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns)
Most ML models prefer to work with numbers, so let's convert the text labels of ocean_proximity to numbers.
Scikit-Learn provides a transformer for this task called LabelEncoder
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing['ocean_proximity']
housing_cat_encoded = encoder.fit_transform(housing_cat)
print(housing_cat_encoded)
print(housing_cat.values)
print(encoder.classes_)
This method is not adapted for the nature of this categorical attributes. with this values, the ML algorithm will assume that two nearby values are more similar than two distant values.
We we'll use another method
print(housing_cat_encoded.shape)
print(housing_cat_encoded.reshape(-1,1).shape)
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot
Instead of transforming the text to integer then to one-hot vectors, we can one shot using the LabelBinarizer class
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
housing_cat_1hot
# We can get a sparse matrix instead by passing sparse_output=True
# To the LabelBinarizer constructor
Althrough scikit-learn provides many useful tranformer, we will have to write our owns tasks. We will create classes and implement three methods : fit(), transform() and fit_transform(). We can have the last one for free by adding TransformerMixin as a base class.
Also, if we add BaseEstimator as a base class (and avoid *args and **kargs in the constructor) we will get two extra methods (get_params() and set_params()) that will be useful for automatic hyperparameter tuning.
# we will build a class that adds the attributes that we computed earlier
from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True):
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self #Nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:,rooms_ix]/X[:,household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else :
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
Feature scaling in one if the most important transformations we need to apply. Because, with few exceptions, ML algortihms don't perform well when the attributes have very different scales
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self,X, y=None):
return self
def transform(self, X, y=None):
return X[self.attribute_names]
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
num_attribs = list(housing_num)
num_pipeline = Pipeline([
("selector",DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
from sklearn.preprocessing import LabelBinarizer
cat_attribs = ["ocean_proximity"]
cat_pipeline = Pipeline([
('selector',DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer())
])
from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
housing_prepared = full_pipeline.fit_transform(housing)
Error from the LabelBinarizer fit_transform method because it passed form
def fit_transform(self, x, y)
...rest of the code
to
def fit_transform(self, x)
...rest of the code
class NewLabelBinarizer(LabelBinarizer):
def fit(self, X, y=None):
return super(NewLabelBinarizer, self).fit(X)
def transform(self, X, y=None):
return super(NewLabelBinarizer, self).transform(X)
def fit_transform(self, X, y=None):
return super(NewLabelBinarizer, self).fit(X).transform(X)
from sklearn.preprocessing import LabelBinarizer
cat_attribs = ["ocean_proximity"]
cat_pipeline = Pipeline([
('selector',DataFrameSelector(cat_attribs)),
('label_binarizer', NewLabelBinarizer())
])
cat_pipeline.fit_transform(housing)
# compare with the LabelBinarizer result
(housing_cat_1hot != cat_pipeline.fit_transform(housing)).sum()
The new class NewLabelBinarizer is equivalent to LabelBinarizer
from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
# executing the whole pipeline
housing_prepared = full_pipeline.fit_transform(housing)
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
Knowing that the median_housing_values range between 120,000 dollars and 265,000 dollars, a RMSE of 68 628 $ is absolutely a terrible result.
The model underfitted the results. It's means that that possibly the features do not providee nough information to make good predictions, or that the model is not powerful enough.
To fix this problem there's three main ways :
This model is not regularized, so this rules out the last option. We can try to add more features (e.g., the log of the population), but first let’s try a more complex model to see how it does.
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
from sklearn.metrics import mean_squared_error
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
The model clearly overfitted the data. To be sure, we will use a validation method
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg,housing_prepared, housing_labels,
scoring='neg_mean_squared_error', cv=10)
rmse_scores = np.sqrt(-scores)
def display_scores(scores):
print("Scores :", scores)
print("Mean :", scores.mean())
print("Standard deviation:", scores.std())
display_scores(rmse_scores)
Now the Decision Tree doesn’t look as good as it did earlier. In fact, it seems to perform worse than the Linear Regression model!
The Decision Tree has a score of approximately 71,200, generally ±3,200.
lin_scores = cross_val_score(lin_reg,housing_prepared, housing_labels,
scoring='neg_mean_squared_error', cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)
That’s right: the Decision Tree model is overfitting so badly that it performs worse than the Linear Regression model.
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
from sklearn.metrics import mean_squared_error
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse
forest_scores = cross_val_score(forest_reg,housing_prepared, housing_labels,
scoring='neg_mean_squared_error', cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
The results are far better than the two previous models. However, the score on the training set is way higher than the results of the testing test, it means the the model stills overfitting
from sklearn.svm import SVR
svr_rbf = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=.1)
svr_lin = SVR(kernel='linear', C=100, gamma='auto')
svr_poly = SVR(kernel='poly', C=100, gamma='auto', degree=3, epsilon=.1,
coef0=1)
svr_rbf.fit(housing_prepared, housing_labels)
svr_lin.fit(housing_prepared, housing_labels)
svr_poly.fit(housing_prepared, housing_labels)
housing_predictions = svr_rbf.predict(housing_prepared)
svr_rbf_mse = mean_squared_error(housing_labels, housing_predictions)
svr_rbf_rmse = np.sqrt(svr_rbf_mse)
print(svr_rbf_rmse)
housing_predictions = svr_lin.predict(housing_prepared)
svr_lin_mse = mean_squared_error(housing_labels, housing_predictions)
svr_lin_rmse = np.sqrt(svr_lin_mse)
print(svr_lin_rmse)
housing_predictions = svr_poly.predict(housing_prepared)
svr_poly_mse = mean_squared_error(housing_labels, housing_predictions)
svr_poly_rmse = np.sqrt(svr_poly_mse)
print(svr_poly_rmse)
svr_rbf_scores = cross_val_score(svr_rbf,housing_prepared, housing_labels,
scoring='neg_mean_squared_error', cv=10)
svr_rbf_rmse_scores = np.sqrt(-svr_rbf_scores)
display_scores(svr_rbf_rmse_scores)
svr_lin_scores = cross_val_score(svr_lin,housing_prepared, housing_labels,
scoring='neg_mean_squared_error', cv=10)
svr_lin_rmse_scores = np.sqrt(-svr_lin_scores)
display_scores(svr_lin_rmse_scores)
svr_poly_scores = cross_val_score(svr_poly,housing_prepared, housing_labels,
scoring='neg_mean_squared_error', cv=10)
svr_poly_rmse_scores = np.sqrt(-svr_poly_scores)
display_scores(svr_poly_rmse_scores)
The results for the svm are even worst than the first model
The Random Forest gave for us the best result, so we will focus on this one
The most important parameter is the number of random features to sample at each split point (max_features).
You could try a range of integer values, such as 1 to 20, or 1 to half the number of input features.
max_features [1 to 20]
Alternately, you could try a suite of different default value calculators.
max_features in [‘sqrt’, ‘log2’]
Another important parameter for random forest is the number of trees (n_estimators).
Ideally, this should be increased until no further improvement is seen in the model.
Good values might be a log scale from 10 to 1,000.
n_estimators in [10, 100, 1000]
For the full list of hyperparameters, see:
sklearn.ensemble.RandomForestClassifier API.
from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)
grid_search.best_params_
Since 8 is the max value of max_features and 30 the maximum value of n_estimators that was evaluated, we should evaluate higher values
param_grid = [
{'n_estimators': [30, 40, 50], 'max_features': [6, 8, 10, 12]},
{'bootstrap': [False], 'n_estimators': [10, 12], 'max_features': [6, 8, 10]},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)
grid_search.best_params_
grid_search.best_estimator_
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
Max feature converge entre 6 et 8.
n_estimators semble croitrre. En améliorant les parametres le modèle passe d'un RMSE de 52456 à un RMSE de 49431.
To get the best hyperparameters, we will use a RandomizedSearchCV which can test a larger distrib
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
param_distrib = {'n_estimators': np.logspace(2, 3, 15, endpoint=True).astype(int).tolist(),
'max_features': [6, 7, 8]}
forest_reg = RandomForestRegressor()
rnd_search = RandomizedSearchCV(forest_reg, param_distrib, n_iter=21, cv=5,
scoring='neg_mean_squared_error',verbose=2,
n_jobs=-1, random_state=42)
rnd_search.fit(housing_prepared,housing_labels)
rnd_search.best_params_
negative_mse = rnd_search.best_score_
rmse = np.sqrt(-negative_mse)
rmse
The RMSE nicer than the one find with the GridSearch
feature_importances = rnd_search.best_estimator_.feature_importances_
feature_importances
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_one_hot_attribs = list(encoder.classes_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
f_impr = sorted(zip(feature_importances, attributes), reverse=True)
final_model = rnd_search.best_estimator_
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
from sklearn.metrics import mean_squared_error
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse
from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
mean = squared_errors.mean()
m = len(squared_errors)
np.sqrt(stats.t.interval(confidence, m - 1,
loc=np.mean(squared_errors),
scale=stats.sem(squared_errors)))
# the best features
feature_importance = rnd_search.best_estimator_.feature_importances_
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_one_hot_attribs = list(encoder.classes_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
class feature_importance_selection(BaseEstimator, TransformerMixin):
def __init__(feature_importances):
f_impr = sorted(zip(feature_importances, attributes), reverse=True)
from sklearn.externals import joblib
joblib.dump(final_model, "datasets/housing/final_model.pkl") # DIFF
#my_model_loaded = joblib.load("my_model.pkl") # DIFF